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Abstract 

The protein folding problem has attracted an increasing attention from physicists. 
The problem has a flavor of statistical mechanics, but possesses the most common 
feature of most biological problems - the profound effects of evolution. I will give 
an introduction to the problem, and then focus on some recent work concerning 
the so-called "designability principle" . The designability of a structure is measured 
by the number of sequences that have that structure as their unique ground state. 
Structures differ drastically in terms of their designability; highly designable struc- 
tures emerge with a number of associated sequences much larger than the average. 
These highly designable structures 1) possess "proteinlike" secondary structures and 
motifs, 2) are thermodynamically more stable, and 3) fold faster than other struc- 
tures. These results suggest that protein structures are selected in nature because 
they are readily designed and stable against mutations, and that such selection si- 
multaneously leads to thermodynamic stability and foldability. According to this 
picture, a key to the protein folding problem is to understand the emergence and 
the properties of the highly designable structures. 
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1 Introduction 



The word "protein" originates from the Greek word proteios which means "of 
the first rank" [1]. Indeed, proteins are building blocks and functional units 
of all biological systems. They play crucial roles in virtually all biological 
processes. Their diverse functions include enzymatic catalysis, transport and 
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storage, coordinated motion, mechanical support, signal transduction, control 
and regulation, and immune response. A protein consists of a chain of amino 
acids whose sequence is determined by the information in DNA/RNA. There 
are 20 natural amino acids nature uses to make up proteins. These differ in 
size and other physical and chemical properties. The most important differ- 
ence however, as far as the determination of the structure is concerned, is 
their hydrophobicity, i.e. how much they dislike water. An open protein chain, 
under normal physiological conditions, will fold into a three-dimensional con- 
figuration to perform its function. This folded functional state of the protein 
is called the native state. For single domain globular proteins which are our 
focus here, the length of the chain is of the order of 100 amino acids (from 
~ 30 to ~ 400). Proteins of longer chains usually form multidomains each of 
which can usually fold independently. In Fig. 1 is shown a globular protein 
flavodoxin whose function is to transport electrons. Like most water soluble 
single domain globular proteins, it is very compact with a roughly rounded 
shape. The folded geometry of the chain can be best viewed in the cartoonish 
ribbon diagram (Fig. lc) of the backbone configuration (Fig. lb). One can see 
that the geometry of this protein structure is several a-helices sandwiching an 
/3-sheet. The folded geometries, often referred to as folds, of proteins usually 
look far more regular than random, typically possessing secondary structures 
(e.g., a-helices and /5-sheets) and sometime even having tertiary symmetries. 
(One can recognize an approximate mirror symmetry in Fig. lc.) One of the 
main goals of the protein folding problem is to predict the three-dimensional 
folded structure for a given sequence of amino acids. 

The protein folding problem is the kind of biological problem that has an im- 
mediate appeal to physicists. A protein can fold (to its native state) and unfold 
(to a flexible open chain) reversibly by changing the temperature, pH, or the 
concentration of some denaturant in solution. While the study of denatura- 
tion of proteins can be traced back at least 70 years when Wu [2] pointed out 
that denaturation was in fact the unfolding of the protein from "the regular 
arrangement of a rigid structure to the irregular, diffuse arrangement of the 
flexible open chain" . A turning point was the work of Anfinsen on the so-called 
"thermodynamic hypothesis" in the late 50s and early 60s. Anfinsen [3] and 
later many others demonstrated that for single domain proteins 1) the infor- 
mation coded in the amino acid sequence of a protein completely determines 
its folded structure, and 2) the native state is the global minimum of the free 
energy. These conclusions should be somewhat surprising to physicists. For 
the configurational "(free) energy landscape" of a heteropolymer of the size 
of a protein is typically "rough", in the sense that there are typically many 
metastable states some of which have energies very close to the global mini- 
mum. How could a protein always fold into its unique native state with the 
lowest energy? The answer is evolution. Indeed, random sequences of amino 
acids are usually "glassy" and usually can not fold uniquely. But natural pro- 
teins are not random sequences. They are a small family of sequences, selected 
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by nature via evolution, that have a distinct global minimum well separated 
from other metastable states (Fig. 2). One might ask: what are the unique 
and yet common properties of this special ensemble of proteinlike sequences? 
In other words, can one distinguish them from other sequences without the 
arguably impossible task of constructing the entire energy landscape? The an- 
swer lies in the heart of the question we introduce in the next paragraph and 
is the focus of this discussion. 

There are about 100,000 different proteins in the human body. The number is 
much larger if we consider all natural proteins in the biological world. Protein 
structures are classified into different folds. Proteins of the same fold have 
the same major secondary structures in the same arrangement with the same 
topological connections [4], with some small variations typically in the loop 
region. So in some sense, folds are distinct templates of protein structures. 
Proteins with a close evolutionary relation often have high sequence similarity 
and share a common fold. What is intriguing is that common folds occur even 
for proteins with different evolutionary origins and biological functions. The 
number of folds is therefore much lower than the number of proteins. Shown in 
Fig. 3 is the cumulative number of solved protein domains [5] along with the 
cumulative number of folds as a function of the year. It is increasingly less likely 
that a newly solved protein structure would take a new fold. It is estimated 
that the total number of folds for all natural proteins is only about 1000 [6,7]. 
Some of the frequently observed folds, or "superfolds" [8], are shown in Fig. 4. 
Among apparent features of these folds are secondary structures, regularities, 
and symmetries. Therefore, as in the case of sequences, protein structures or 
folds are also a very special class. One might ask: Is there anything special 
about natural protein folds-are they merely an arbitrary outcome of evolution 
or is there some fundamental reason behind their selection [9]? Is the selection 
of protein structures coupled with the selection of protein sequences? We will 
now address these questions via a thorough study of simple models. 



2 Simple Models and the Designability 

The dominant driving force for protein folding is the so-called hydrophobic 
force [12]. The 20 amino acids differ in their hydrophobicity and can be very 
roughly classified into two groups: hydrophobic and polar [13]. Hydrophobic 
amino acids have greasy side chains made of hydrocarbons and like to stick 
together in water to minimize their contact with water. Polar amino acids have 
polar groups (with oxygen or nitrigen) in their side chains and do not mind 
so much to be in contact with water. The simplest model of protein folding 
is the so-called "HP lattice model" [14], whose structures are defined on a 
lattice and whose sequences take only two "amino acids": H (hydrophobic) 
and P (polar) (see Fig. 5). The energy for a sequence folded into a structure 
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is simply given by the short range contact interactions 

H = J2e^A(r l -r J ), (1) 

i<j 

where A(rj — = 1 if and are adjoining lattice sites but i and j are not 
adjacent in position along the sequence, and A(r; — r 3 -) = otherwise. De- 
pending on the types of monomers in contact, the interaction energy e ViV . will 
be enH, enp, or epp, corresponding to H-H, H-P, or P-P contacts, respectively 
(see Fig. 5) [15]. We [16] choose these interaction parameters to satisfy the 
following physical constraints: 1) compact shapes have lower energies than any 
non-compact shapes; 2) H monomers are buried as much as possible, expressed 
by the relation epp > e^p > enH, which lowers the energy of configurations in 
which Hs are hidden from water; 3) different types of monomers tends to seg- 
regate, expressed by 2enp > epp + enH- Conditions 2) and 3) were derived from 
the analysis [13] of the real protein data contained in the Miyazawa-Jernigan 
matrix [17] of inter-residue contact energies between different types of amino 
acids. Since we consider only the compact structures all of which have the 
same total number of contacts, we can freely shift and rescale the interaction 
energies, leaving only one free parameter. Throughout this section, we choose 
ejjH = —2.3, enp = — 1 and epp = which satisfy conditions 2) and 3) above. 
The results are insensitive to the value of e H H as long as both these conditions 
are satisfied. (The analysis in Ref. [13] on the interaction potential of amino 
acids arrived at a form 

e l „, = h ll + h v + c(n,v), (2) 

where is the hydrophobicity of the amino acid fx and c is a small mixing 
term. The additive term, i.e. the hydrophobic force, dominates the potential. 
The choice of enH = —2.3 in our study can be viewed as a result of a hydropho- 
bic part -2 plus a small mixing part -0.3. Several authors have investigated 
the effect of the mixing contribution as a small perturbation to the additive 
potential [18,19].) 

We have studied the model (1) on a three dimensional cubic lattice and on a 
two dimensional square lattice [16]. For the three dimensional case, we ana- 
lyze a chain composed of 27 monomers. We consider all the structures which 
form a compact 3x3x3 cube. There are a total of 51,704 such structures 
unrelated by rotational, reflection, or reverse labeling symmetries [20,16]. For 
a given sequence, the ground state structure is found by calculating the en- 
ergies of all compact structures. We completely enumerate the ground states 
of all 2 27 possible sequences. We find that only 4.75% of the sequences have 
unique ground states and thus are potential proteinlike sequences. We then 
calculate the designability of each compact structure. Specifically, we count 



4 



the number of sequences N$ that have a given compact structure S as their 
unique ground state. We find that compact structures differ drastically in 
terms of their designability, N s . There are structures that can be designed 
by an enormous number of sequences, and there are "poor" structures which 
can only be designed by a few or even no sequences. For example, the top 
structure can be designed by 3,794 different sequences (Ns = 3,794), while 
there are 4, 256 structures for which Ns = 0. The number of structures having 
a given N s decreases monotonically (with small fluctuations) as Ns increases 
(Fig. 6a). There is a long tail to the distribution. Structures contributing to 
the tail of the distribution have N s » N s = 61.7, where N s is the average 
number. We call these structures "highly designable" structures. The distri- 
bution is very different from the Poisson distribution (also shown in Fig. 6a) 
that would result if the compact structures were statistically equivalent. For 
a Poisson distribution with a mean Ns = 61.7, the probability of finding even 
one structure with N s > 120 is 1.76 x 1(T 6 . 

The highly designable structures are, on average, thermodynamically more 
stable than other structures. The stability of a structure can be characterized 
by the average energy gap 5s, averaged over the Ns sequences that design the 
structure. For a given sequence, the energy gap 6s is defined as the minimum 
energy difference between the ground state energy and the energy of a different 
compact structure. We find that there is a marked correlation between Ns 
and 5s (Fig. 6b). Highly designable structures have average gaps much larger 
than those of structures with small Ns, and there is a sudden jump in 5s 
for structures with N s ~ 1,400. This jump is a result of two different kinds 
of excitations a ground state could have. One is to break an H-H bond and 
a P-P bond to form two H-P bonds, which has an (mixing) energy cost of 
2Eftp — £hh — -^pp = 0-3- The other is to change the position of an H-mer 
from relatively buried to relatively exposed, so the number of H-water bonds 
(the lattice sites outside the 3x3x3 cube are occupied by water molecules) 
is increased. This kind of excitations has an energy > 1. The jump in Fig. 6b 
indicates that the lowest excitations are of the first kind for Ns < 7V|, but are 
a mixture of the first and the second kind for Ns > N s . 

A striking feature of the highly designable structures is that they exhibit 
certain geometrical regularities that are absent from random structures and 
are reminiscent of the secondary structures in natural proteins. In Fig. 7 is 
shown the most designable structure along with a typical random structure. 
We examined the compact structures with the 10 largest N s values and found 
that all have parallel running lines folded in a regular manner. 

We have also studied the model on a 2D lattice. We take sequences of length 
36 and fold them into compact 6x6 structures on the square lattice. There are 
28, 728 such structures unrelated by symmetries including the reverse-labeling 
symmetry. In this case, we did not enumerate all 2 36 sequences but randomly 
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sampled them to the extend where the histogram for iVs's reached a reliable 
distribution. Similar to the 3D case, the iVg's have a very broad distribution 
(Fig. 8a). In this case the tail decays more like an exponential. The average 
gap also correlates positively with N s (Fig. 8b). Again similar to the 3D case, 
we observe that the highly designable structures in 2D also exhibit secondary 
structures. In the 2D 6x6 the surface-to-interior ratio approaches 

that of real proteins, the highly designable structures often have bundles of 
pleats and long strands, reminiscent of a helices and (3 strands in real proteins; 
in addition, some of the highly designable structures have tertiary symmetries 
(Fig. 9). 

To ensure that the above results are not an artifact of the HP model, we have 
studied model (1) with 20 amino acids [21]. In this case the interaction energies 
e ViUj , where z/j can now be any one of the 20 amino acids, are taken from the 
Miyazawa-Jernigan matrix [17]-an empirical potential between amino acids. 
For the 3D 3x3x3 system and the 2D 6x6 system, the total numbers 
of sequences are 20 27 and 20 36 , respectively, which are impossible to enumer- 
ate. So we randomly sampled the sequence space. Similar to the case of the 
HP model, Afe's have a broad distribution in both 3D and 2D cases. Fur- 
thermore, the TVs's correlate well with the ones obtained from the HP model 
(see Fig. 10). Thus the highly designable structures in the HP model are also 
highly designable in the 20-letter model [22]. With 20 amino acids, there are 
few sequences that will have exactly degenerate ground states. For example, 
in the case of 3 x 3 x 3 about 96.7% of the sequences have unique ground 
states. However, many of these ground states are almost degenerate, in the 
sense that there are compact structures other than the ground state with en- 
ergies very close to the ground state energy. If we require that for a ground 
state to be truly unique there should be no other states of energies within g c 
from the ground state energy, then the percentage of the sequences that have 
unique ground states is reduced to about 30% and 8% for g c = 0.4/c B T and 
g c = 0.8k B T, respectively. 



3 A Geometrical Interpretation 

A number of questions arise: Among the large number of structures, why 
are some structures highly designable? Why does designability also guarantee 
thermodynamic stability? Why do highly designable structures have geometri- 
cal regularities and even symmetries? In this section we address these questions 
by using a geometrical formulation of the protein folding problem [23]. 

As we have mentioned before, the dominant driving force for protein folding 
is the hydrophobicity, i.e. the tendency for hydrophobic amino acids to hide 
away from water. To model only the hydrophobic force in protein folding, one 



6 



can assign parameters h v to characterize the hydrophobicities of each of the 20 
amino acids [24]. Each sequence of amino acids then has an associated vector 
h = (h vi , h U2 , . . . , h Ui , . . . , h UN ), where z/j specifies the amino acid at position 
% of the sequence. The energy of a sequence folded into a particular structure 
is taken to be the sum of the contributions from each amino acid upon burial 
away from water: 

H = -Y t 8 i h Vi , (3) 
i=i 

where Sj is a structure-dependent number characterizing the degree of burial 
of the i-th amino acid in the chain. Eq. (3) is essentially a solvation model 
[25] at the residue level and can also be obtained by taking the mixing term 
of Eq. (2) to zero [18,19,23]. 

To simplify the discussion, let us consider only compact structures and let Sj 
take only two values: and 1, depending on whether the amino acid is on the 
surface or in the core of the structure, respectively. Therefore, each compact 
structure can be represented by a string {si} of 0s and Is: Sj = if the i-th 
amino acid is on the surface and s$ = 1 if it is in the core (see Fig. 11a for an 
example on a lattice). Let us make further simplification by using only two 
amino acids: z/j = H or P, and let hu — 1 and hp = 0. Thus a sequence 
is also mapped into a string {a t } of 0s and Is: <7j = 1 if — H and <7j = if 
z/j = P. Let us call this model the PH (Purely Hydrophobic) model. Assuming 
every compact structure of a given size has the same numbers of surface and 
core sites and noting that the term J2i °"i 2 is a constant for a fixed sequence 
of amino acids and does not play any role in determining the relative energies 
of structures folded by the sequence, Eq. (3) is then equivalent to [23]: 

N 

H = 5> - Sl f. (4) 
i=i 

Therefore, the energy for a sequence a = {<7j} folded onto a structure s — {s,} 
is simply the distance squared (or the Hamming distance in the case where 
both {(Tj} and {sj} are strings of 0s and Is) between the two vectors a and s. 

We can now formulate the designability question geometrically. We have two 
ensembles or spaces: one being all the sequences {a} and the other all the 
structures {s}. Both are represented by iV-dimensional points or vectors where 
N is the length of the chain. The points of all the sequences are trivially 
distributed in the iV-dimensional space. In the case of the PH model, the points 
representing sequences are all the vertices of an iV-dimensional hypercube (all 
possible 2^ strings of 0s and Is of length N). On the other hand, the points 
representing all the structures {s} have a very different distribution in the 
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iV-dimensional space. The s's are constrainted and correlated. For example, 
in the case of the PH model where = or 1, not every string of Os and 
Is actually represents a structure. In fact, only a very small fraction of the 
2 N strings of Os and Is correspond to structures. If we consider only compact 
structures where J2i Sj = n c with n c the number of core sites, then the structure 
vectors {s} cover only a small fraction of the vertices of a hyperplane in the 
iV-dimensional hypercube. 

Now imagine putting all the sequences {<?} and all the structures {s} together 
in the iV-dimensional space (see Fig. 12 for a schematic illustration). (In a 
more general case it would be simplest to picture if one normalizes {h} and 
{s} so that < hi, Si < 1.) From Eq. (4), it is evident that a sequence will 
have a structure as its unique ground state if and only if the sequence is 
closer (measured by the distance defined by Eq. (4)) to the structure than to 
any other structures. Therefore, the set of all sequences {a(s}} that uniquely 
design a structure s can be found by the following geometrical construction: 
Draw bisector planes between s and all of its neighboring structures in the TV- 
dimensional space (see Fig. 12). The volume enclosed by these planes is called 
the Voronoi polytope around s. {<?(s)} then consists of all sequences within the 
Voronoi polytope. Hence, the designabilities of structures are directly related 
to the distribution of the structures in the iV-dimensional space. A structure 
closely surrounded by many neighbors will have a small Voronoi polytope and 
hence a low designability; while a structure far away from others will have 
a large Voronoi polytope and hence a high designability. Furthermore, the 
thermodynamic stability of a folded structure is directly related to the size of 
its Voronoi polytope. For a sequence a, the energy gap between the ground 
state and an excited state is the difference of the squared distances between a 
and the two states (Eq. (4)). A larger Voronoi polytope implies, on average, a 
larger gap as excited states can only lie outside of the Voronoi polytope of the 
ground state. Thus, this geometrical representation of the problem naturally 
explains the positive correlation between the thermodynamic stability and the 
designability. 

As a concrete example, we have studied a 2D PH model of 6 x 6 [23]. For 
each compact structure, we devide the 36 sites into 16 core sites and 20 sur- 
face sites (see Fig. 11a). Among the 28,728 compact structures unrelated by 
symmetries, there are 119 that are reverse-labeling symmetric. (For a reverse- 
labeling symmetric structure, Sj = sjv+i-i-) So the total number of structures 
a sequence can fold onto is (28,728 — 119) x 2 + 119 = 57,337, which map 
into 30,408 distinct strings. There are cases in which two or more structures 
map into the same string. We call these structures degenerate structures and 
a degenerate structure can not be the unique ground state for any sequence 
in the PH model. Out of the 28, 728 structures, there are 9, 141 nondegener- 
ate structures (or 18,213 out of 57,337). A histogram for the designability of 
nondegenerate structures is obtained by sampling the sequence space using 
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19,492,200 randomly chosen sequences and is shown in Fig. lib. The set of 
highly designable structures are essentially the same as those obtained from 
the HP model discussed in the previous section. To further probe how structure 
vectors are distributed in the iV-dimensional space, we measure the number 
of structures, ng(d), at a Hamming distance d from a given structure s. Note 
that all the 57, 337 structures are distributed on the vertices of the hyperplane 
defined by J2i s i — 16. There are a total of C3I = 7, 307, 872, 110 vertices in the 
hyperplane. If the structure vectors were distributed uniformly on these ver- 
tices, n s (d) would be the same for all structures and would be: n°(d) = pN(d), 
where p = 57,337/7,307,872, 110 is the average density of structures on the 
hyperplane and N(d) = C^C^q is the number of vertices at distance d from 
a given vertex. In Fig. 13a, ng(d) is plotted for three different structures with 
low, intermediate, and high designabilities, respectively, along with n°(d). We 
see that a highly designable structure typically has fewer neighbors than a less 
designable structure, not only at the smallest ds but out to ds of order 10-12. 
Also, ng(d) is considerably larger than n°(d) for small d for structures with 
low designability. These results indicate that the structures are very nonuni- 
formly distributed and are clustered-there are highly populated regions and 
lowly populated regions. A quantitative measure of the clustering environment 
around a structure is the second moment of ng(d) , 

7 2 (s) =< d 2 > - < d > 2 = 4 SiSjCij, (5) 

ij 

where 

Cij =< SiSj > — < Si >< Sj > (6) 

and < • > denotes average over all structures. In Fig. 14a we plot the des- 
ignability N s of a structure vs. its 7. We see that while larger N s implies 
smaller 7 it is not true vice versa. This is because that Ns is very sensitive to 
the local environment at small ds while 7 is more a global measure. 

What are the geometrical characteristics of the structures in the highly pop- 
ulated regions and lowly populated regions, respectively? This is something 
we are very interested in but know very little about. Naively, the structures 
in the highly populated regions are typical random structures which can be 
easily transformed from one to another by small local changes. On the other 
hand, structures in lowly populated regions are "atypical" structures which 
tend to be more regular and "rigid" . They have fewer neighbors so it is harder 
to transform them to other structures with only small rearrangements. One 
geometrical feature of highly designable structures is that they have more 
surface-to-core transitions along the backbone, i.e. there are more transitions 
between 0s and Is in the structure string for a highly designable structure than 
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average [23] . We found a good correlation between the number of surface-core 
transitions in a structure string s, T(s), and 7(5) (Fig. 13b). Thus, a necessary 
condition for a structure to be highly designable is to have a small 7 or a large 
T. 

A great advantage of the PH model is that it is simple enough to test some 
ideas immediately. Two quantities often used to characterize structures are 
the energy spectra Af(E,s) [10,26] and J\f(E,s,C) [26]. The first one is the 
energy spectrum of a given structure, s over all sequences, {a}: 

Af(E,s)=J2S[H(°^-E}. (7) 

M 



The second one is over all sequences of a fixed composition C (e.g. fixed 
numbers of H-mers and P-mers in the case of two-letter code), {<t}c- 

M{E,s,C)= £ 5[H(a,s)-E}. (8) 



It is easy to see that if two structure strings {s^} and {s^} are related by 
permutation, i.e. = s' k ., for i — 1, 2, • • • , N, where hi, k 2 , ■ • ■ , is a permu- 
tation of 1, 2, • • • , N, then Af(E, s) = Af(E, s') and M(E, s, C) = Af(E, s', C). 
Thus all maximally compact structures have the same energy spectra Eqs. (7) 
and (8). Therefore, structures differ in designability, not because they have dif- 
ferent energy spectra Eqs. (7) and (8) [10,26], but because they have different 
neighborhood in the structure space. 



4 Folding Dynamics and Thermodynamic Stability 



Will highly designable structures also fold relatively fast? This question is 
addressed in detail in Ref. [27] (see also Ref. [28]). A quantity often used to 
measure how much a sequence is "proteinlike" is the Z score, 



where A is the average energy difference between the ground state and all other 
states and Y is the standard deviation of the energy spectrum. Z score was 
first introduced in the inverse folding problem [29] and later used in protein 
design [30]. It has been shown in the context of the Random Energy Model 
[31] that Z score is related to Tf/T g where Tf is the folding temperature and 
T g the glass transition temperature [32]. We have found a good and negative 
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correlation between the folding time and the Z score of the compact structure 

— * 

energy spectrum [27]. In the context of the PH model (3), for a sequence h 
and its ground state s, 



A = 5>;( S ~< Si >), (10) 



l^hihjCij, (11) 



where Qj is given by Eq. (6). So in principle for every structure s one can 
maximize the Z score with respect to h to get the "best" or "ideal" sequence 
for s that gives the highest Z score, Z$. It is however much easier to obtain a 
lower bound Z s ' for Z s by letting h = s: Z s ' = A'/T' with 



A' = $>?"*<*>)> (12) 

i 

r' = 7 /2, (13) 

where 7 is given by Eq. (5). In Fig. 14b, the A' for all the 6x6 compact 
structures are plotted against Ns for the PH model. There is little if any 
correlation between N s and A' for the 6x6 PH model. Thus, correlations 
between Ns and Z 1 in this model come mainly from the one between Ns and 
T' = 7/2 (Fig. 14a). So a large Z' is a necessary but not sufficient condition 
for a structure to have a large Ns. 



5 Summary 



We have demonstrated with simple models that structures are very different 
in terms of their designability and that high designability leads to thermo- 
dynamic stability, "proteinlike" structural motifs, and foldability. Highly des- 
ignable structures emerge because of an asymmetry between the sequence and 
the structure ensembles. Our results are rather robust and have been demon- 
strated recently in larger lattice models [33] and in off-lattice models [34]. 
A broad distribution of designability has also been found in RNA secondary 
structures [35]. However, the set of all sequences designing a good structure, 
instead of forming a compact "Voronoi polytope" like in proteins, forms a 
"neutral network" percolating the entire space [35]. It would be interesting 
to study the similarities and differences of the two systems. Finally, our pic- 
ture indicates that the properties of the proteinlike sequences are intimately 
coupled to that of the proteinlike (i.e. the highly designable) structures; the 
picture unifies various aspects of the two special ensembles. It also suggests 
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that understanding the emergence and properties of the highly designable 
structures is a key to the protein folding problem. 

This work was done in collaboration with Hao Li, Ned Wingreen, Regis Melin, 
Robert Helling, and Jonathan Miller. I am grateful to Jeannie Chen for her 
critical reading of the manuscript. 
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Fig. 1. The protein flavodoxin. (a) The atomic model. Different atoms are shown in 
different grey scales: oxygen (dark), nitrogen (dark grey), carbon (light grey), and 
sulfur (white). Hydrogen atoms are not shown, (b) The backbone of the structure 
which is the formed by connecting the C a atoms of each amino acid along the chain, 
(c) The ribbon diagram of the backbone. /3-strand is shown in dark, a-helix in grey, 
and turns and loops in white. 
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Fig. 2. The schematic energy landscapes of (a) a protein sequence and (b) a random 
sequence. 
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Fig. 3. The cumulative numbers of PDB domains, (non-redundant) protein domains, 
and folds vs. year. Source: SCOP [4] and Ref. [6]. Courtesy of Dr. Steven Brenner. 
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Fig. 4. Representatives of some popular folds. 
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(a) 





Fig. 6. (a) Histogram of N$ for the 3x3x3 system, (b) Average energy gap between 
the ground state and the first excited state vs. N$ for the 3x3x3 system. 
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Fig. 7. The top structure (a) and an ordinary structure with Ns = 1 (b) for the 
3x3x3 system. 
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Fig. 10. (a) Histogram of Ns for the 2D 6x6 model with the MJ matrix, obtained 
with 3, 990, 000 random sequences, (b) Ns from the HP model vs. Ns from the MJ 
matrix for 2D 6 x 6 structures. 




200 400 600 800 10001200 



000110000001111111100000110000011110 N s 

Fig. 11. (a) A 6 x 6 compact structure and its corresponding string. A structure is 
represented by a string s of 0s and Is, according to whether a site is on the surface 
or in the core (which is enclosed by the dotted lines), respectively. (In fact, two 
structures, related by the reverse-labeling symmetry, are shown, corresponding to 
the two opposite paths indicated by the two arrows. So the s of one structure is the 
reverse of the other.) (b) The histogram of Ns for the 6x6 PH model obtained by 
using 19,492,200 randomly chosen sequences. 
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Fig. 12. The sequence and structure ensembles in iV-dimension. 
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Fig. 13. (a) Number of structures vs. the Hamming distance for three structures 
with low (circles), intermediate (triangles) and high (squares) designability. Also 
plotted is n°(d) (solid line), (b) The number of transitions between core and surface 
sites vs. 7 for all the 6x6 compact structures. 
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